401 research outputs found

    Incorporating source-language paraphrases into phrase-based SMT with confusion networks

    Get PDF
    To increase the model coverage, sourcelanguage paraphrases have been utilized to boost SMT system performance. Previous work showed that word lattices constructed from paraphrases are able to reduce out-ofvocabulary words and to express inputs in different ways for better translation quality. However, such a word-lattice-based method suffers from two problems: 1) path duplications in word lattices decrease the capacities for potential paraphrases; 2) lattice decoding in SMT dramatically increases the search space and results in poor time efficiency. Therefore, in this paper, we adopt word confusion networks as the input structure to carry source-language paraphrase information. Similar to previous work, we use word lattices to build word confusion networks for merging of duplicated paths and faster decoding. Experiments are carried out on small-, medium- and large-scale English– Chinese translation tasks, and we show that compared with the word-lattice-based method, the decoding time on three tasks is reduced significantly (up to 79%) while comparable translation quality is obtained on the largescale task

    Facilitating translation using source language paraphrase lattices

    Get PDF
    For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and largescale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resource sufficient pairs to some extent

    Improved phrase-based SMT with syntactic reordering patterns learned from lattice scoring

    Get PDF
    In this paper, we present a novel approach to incorporate source-side syntactic reordering patterns into phrase-based SMT. The main contribution of this work is to use the lattice scoring approach to exploit and utilize reordering information that is favoured by the baseline PBSMT system. By referring to the parse trees of the training corpus, we represent the observed reorderings with source-side syntactic patterns. The extracted patterns are then used to convert the parsed inputs into word lattices, which contain both the original source sentences and their potential reorderings. Weights of the word lattices are estimated from the observations of the syntactic reordering patterns in the training corpus. Finally, the PBSMT system is tuned and tested on the generated word lattices to show the benefits of adding potential sourceside reorderings in the inputs. We confirmed the effectiveness of our proposed method on a medium-sized corpus for Chinese-English machine translation task. Our method outperformed the baseline system by 1.67% relative on a randomly selected testset and 8.56% relative on the NIST 2008 testset in terms of BLEU score

    CCG contextual labels in hierarchical phrase-based SMT

    Get PDF
    In this paper, we present a method to employ target-side syntactic contextual information in a Hierarchical Phrase-Based system. Our method uses Combinatory Categorial Grammar (CCG) to annotate training data with labels that represent the left and right syntactic context of target-side phrases. These labels are then used to assign labels to nonterminals in hierarchical rules. CCG-based contextual labels help to produce more grammatical translations by forcing phrases which replace nonterminals during translations to comply with the contextual constraints imposed by the labels. We present experiments which examine the performance of CCG contextual labels on Chinese–English and Arabic–English translation in the news and speech expressions domains using different data sizes and CCG-labeling settings. Our experiments show that our CCG contextual labels-based system achieved a 2.42% relative BLEU improvement over a PhraseBased baseline on Arabic–English translation and a 1% relative BLEU improvement over a Hierarchical Phrase-Based system baseline on Chinese–English translation

    Source-side syntactic reordering patterns with functional words for improved phrase-based SMT

    Get PDF
    Inspired by previous source-side syntactic reordering methods for SMT, this paper focuses on using automatically learned syntactic reordering patterns with functional words which indicate structural reorderings between the source and target language. This approach takes advantage of phrase alignments and source-side parse trees for pattern extraction, and then filters out those patterns without functional words. Word lattices transformed by the generated patterns are fed into PBSMT systems to incorporate potential reorderings from the inputs. Experiments are carried out on a medium-sized corpus for a Chinese–English SMT task. The proposed method outperforms the baseline system by 1.38% relative on a randomly selected testset and 10.45% relative on the NIST 2008 testset in terms of BLEU score. Furthermore, a system with just 61.88% of the patterns filtered by functional words obtains a comparable performance with the unfiltered one on the randomly selected testset, and achieves 1.74% relative improvements on the NIST 2008 testset

    Lattice score based data cleaning for phrase-based statistical machine translation

    Get PDF
    Statistical machine translation relies heavily on parallel corpora to train its models for translation tasks. While more and more bilingual corpora are readily available, the quality of the sentence pairs should be taken into consideration. This paper presents a novel lattice score-based data cleaning method to select proper sentence pairs from the ones extracted from a bilingual corpus by the sentence alignment methods. The proposed method is carried out as follows: firstly, an initial phrasebased model is trained on the full sentencealigned corpus; then for each of the sentence pairs in the corpus, word alignments are used to create anchor pairs and sourceside lattices; thirdly, based on the translation model, target-side phrase networks are expanded on the lattices and Viterbi searching is used to find approximated decoding results; finally, BLEU score thresholds are used to filter out the low-score sentence pairs for the data cleaning purpose. Our experiments on the FBIS corpus showed improvements of BLEU score from 23.78 to 24.02 in Chinese-English

    The Derived Ring of Differential Operators

    Full text link
    By reading a standard formula for the ring of Grothendieck differential operators in a derived way, we construct a derived (sheaf of) ring of Grothendieck differential operators for Noetherian schemes XX separated and finite-type over a base SS, when the map XSX \to S is finite tor-amplitude. Using this ring of differential operators, we (re-)develop the theory of DD-modules from scratch and show an equivalence of categories between DD-modules using our definition and crystals over the infinitesimal site.Comment: 46 page

    Grothendieck Duality via Diagonally Supported Sheaves

    Full text link
    Following a formula found in the paper of Avramov, Iyengar, Lipman, and Nayak (2010) and ideas of Neeman and Khusyairi, we indicate that Grothendieck duality for finite tor-amplitude maps can be developed from scratch via the formula f!:=δπ1×ff^! := \delta^*\pi_1^{\times}f^*. Our strategy centers on the subcategory ΓΔ(QCoh(X×X))\Gamma_{\Delta}(\mathrm{QCoh}(X \times X)) of quasicoherent sheaves on X×XX \times X supported on the diagonal. By exclusively using this subcategory instead of the full category QCoh(X×X)\mathrm{QCoh}(X \times X) we give systematic categorical proofs of results in Grothendieck duality and reprove many formulas found in Neeman (2018). We also relate some results in Grothendieck duality with properties of the sheaf of (derived) Grothendieck differential operators.Comment: 27 page

    Financial sentiment analysis using FinBERT with application in predicting stock movement

    Full text link
    We apply sentiment analysis in financial context using FinBERT, and build a deep neural network model based on LSTM to predict the movement of financial market movement. We apply this model on stock news dataset, and compare its effectiveness to BERT, LSTM and classical ARIMA model. We find that sentiment is an effective factor in predicting market movement. We also propose several method to improve the model.Comment: CS224U projec

    Lift & Project Systems Performing on the Partial-Vertex-Cover Polytope

    Full text link
    We study integrality gap (IG) lower bounds on strong LP and SDP relaxations derived by the Sherali-Adams (SA), Lovasz-Schrijver-SDP (LS+), and Sherali-Adams-SDP (SA+) lift-and-project (L&P) systems for the t-Partial-Vertex-Cover (t-PVC) problem, a variation of the classic Vertex-Cover problem in which only t edges need to be covered. t-PVC admits a 2-approximation using various algorithmic techniques, all relying on a natural LP relaxation. Starting from this LP relaxation, our main results assert that for every epsilon > 0, level-Theta(n) LPs or SDPs derived by all known L&P systems that have been used for positive algorithmic results (but the Lasserre hierarchy) have IGs at least (1-epsilon)n/t, where n is the number of vertices of the input graph. Our lower bounds are nearly tight. Our results show that restricted yet powerful models of computation derived by many L&P systems fail to witness c-approximate solutions to t-PVC for any constant c, and for t = O(n). This is one of the very few known examples of an intractable combinatorial optimization problem for which LP-based algorithms induce a constant approximation ratio, still lift-and-project LP and SDP tightenings of the same LP have unbounded IGs. We also show that the SDP that has given the best algorithm known for t-PVC has integrality gap n/t on instances that can be solved by the level-1 LP relaxation derived by the LS system. This constitutes another rare phenomenon where (even in specific instances) a static LP outperforms an SDP that has been used for the best approximation guarantee for the problem at hand. Finally, one of our main contributions is that we make explicit of a new and simple methodology of constructing solutions to LP relaxations that almost trivially satisfy constraints derived by all SDP L&P systems known to be useful for algorithmic positive results (except the La system).Comment: 26 page
    corecore